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Abstract 

Background: Genomic datasets generated by new technologies are increasingly prevalent in disparate areas of 
biological research. While many studies have sought to characterize relationships among genomic features, 
commensurate efforts to characterize relationships among biological samples have been less common. 
Consequently, the full extent of sample variation in genomic studies is often under-appreciated, complicating 
downstream analytical tasks such as gene co-expression network analysis. 

Results: Here we demonstrate the use of network methods for characterizing sample relationships in microarray 
data generated from human brain tissue. We describe an approach for identifying outlying samples that does not 
depend on the choice or use of clustering algorithms. We introduce a battery of measures for quantifying the 
consistency and integrity of sample relationships, which can be compared across disparate studies, technology 
platforms, and biological systems. Among these measures, we provide evidence that the correlation between the 
connectivity and the clustering coefficient (two important network concepts) is a sensitive indicator of 
homogeneity among biological samples. We also show that this measure, which we refer to as cor{K,Q, can 
distinguish biologically meaningful relationships among subgroups of samples. Specifically, we find that cor{K,Q 
reveals the profound effect of Huntington's disease on samples from the caudate nucleus relative to other brain 
regions. Furthermore, we find that this effect is concentrated in specific modules of genes that are naturally 
co-expressed in human caudate nucleus, highlighting a new strategy for exploring the effects of disease on sets of 
genes. 

Conclusions: These results underscore the importance of systematically exploring sample relationships in large 
genomic datasets before seeking to analyze genomic feature activity. We introduce a standardized platform for this 
purpose using freely available R software that has been designed to enable iterative and interactive exploration of 
sample networks. 

Keywords: Sample networks, Sample network analysis, Huntington's disease, Clustering coefficient, cor(K,C), 
Standardized C(k) curve, Data pre-processing, Microarrays, Gene expression 



Background 

Genomic studies capture an enormous amount of informa- 
tion about the molecular organization of biological sys- 
tems. Understanding this organization poses a challenge 
for biologists. In most genomic studies, the number of fea- 
tures (gene expression levels, methylation status, protein 
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abundance, etc.) far exceeds the number of biological sam- 
ples under investigation. Consequently, while network 
methods are often used to illuminate patterns among pair- 
wise relationships of genomic features, the rich information 
contained in the connectivity patterns among samples 
remains comparatively untapped. However, patterns of co- 
variation in genomic feature activity ultimately reflect het- 
erogeneity among biological samples. It is therefore critical 
to understand the extent of sample heterogeneity before 
analyzing genomic feature activity, and whenever possible 
to relate sample heterogeneity to known sample traits, 
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which may include both biological and technical sources of 
variation. In practice, biologists' efforts to explore sample 
relationships in genomic studies are an integral component 
of data pre-processing, yet they are often performed in a 
perfunctory fashion using platform-specific and qualitative 
criteria. 

A popular approach for exploring sample relationships 
is cluster analysis. Cluster analysis is appealing for its in- 
tuitive nature, and is typically used for sample outlier de- 
tection, identification of globally distinct subgroups of 
samples, and identification of distinct subgroups of sam- 
ples using pre-selected lists of features (e.g. genes, vox- 
els, etc.) [1-4]. Although widely used, cluster analysis 
suffers from several shortcomings that are often under- 
appreciated by biologists. Besides depending on the 
measure used to quantify similarities among samples, 
the results of cluster analysis can depend heavily on the 
specific clustering algorithm that is employed. For ex- 
ample, dendrograms produced by hierarchical clustering 
algorithms acting on the same data may look quite differ- 
ent depending on whether single, average, or complete 
linkage is used to calculate distances between clusters 
[2,5,6]. Other clustering procedures may involve additional 
parameter choices that can have a substantial effect on 
cluster assignments (e.g. the choice of k in /c-means cluster- 
ing) [1,5]. Finally, cluster analysis can be impractical for 
very large datasets, in which the sheer number of samples 
obscures the organization and characteristics of a dendro- 
gram and produces ambiguous cluster boundaries. 

In this study we explore alternative means of describing 
sample relationships in topological terms by transforming 
a (dis-)similarity matrix into a network adjacency matrix. 
Our correlation-based sample network can be interpreted 
as a polynomial kernel, which implies that the symmetric 
adjacency matrix is positive semi-definite. Many methods 
exist to address the challenge of mapping biological and 
genomic information to kernel matrices [7,8]. Kernel meth- 
ods involving genomic similarity measures are the basis of 
many statistical analytic methods such as nonparametric 
regression, mixed models, hierarchical regression models, 
score statistics, and support vector machines [9]. Our pri- 
mary approach in this study uses a signed weighted correl- 
ation network, since the resulting kernel i) works well in 
practice, as shown in our applications, and ii) allows for a 
geometric interpretation of network concepts [10]. 

The approach we describe here is a useful complement 
to cluster analysis, but does not actually require that cluster 
analysis be performed. A novel feature of our approach is 
that we show how distinctions among subgroups of sam- 
ples can be identified using topological measures (both glo- 
bally and for subsets of genes), which are based on 
network concepts. Network concepts include the connect- 
ivity (which quantifies the strength of each nodes connec- 
tions with its neighbors) and the clustering coefficient 



(which quantifies the strength of each nodes neighbors' 
connections with each other) [11]. The definitions of these 
and many other important network concepts are reviewed 
below and elsewhere [10,12,13]. 

We illustrate our approach using microarray data gen- 
erated from multiple human brain regions of control 
(CTRL) subjects and patients with Huntington's disease 
(HD) [14]. HD is a progressive and incurable neurode- 
generative disorder characterized by preferential destruc- 
tion of medium spiny neurons in the striatum [15] and 
caused by a CAG-repeat expansion in the coding region 
of the huntingtin gene, which is thought to confer a 
toxic gain-of-function to the mutant huntingtin protein 
[16]. Alterations in gene expression are considered a 
central feature of HD pathology, and the extent to which 
specific gene expression changes precede disease path- 
ology is an area of active investigation [14,17-20]. Our 
results indicate that HD exerts a profound effect on 
sample network topology in the caudate nucleus relative 
to other (less affected) brain regions. Specifically, we find 
that the relationship between the standardized sample con- 
nectivity and the standardized sample clustering coefficient 
follows a simple scaling law in unaffected brain regions, 
but undergoes a sharp transition for HD caudate nucleus 
samples that reflects the degradation of sample correlation 
network structure in this brain region. By restricting sam- 
ple network construction to modules (subsets) of genes 
that are naturally co-expressed in human caudate nucleus 
[21], we find that this degradation is most significant in a 
neuronal signal transduction module. Our findings demon- 
strate that sample networks can enhance the results of 
cluster analysis not only with respect to relatively simple 
tasks such as outlier identification, but also with respect to 
more complex challenges such as group comparisons. 

Results 

The approach we describe in this study formalizes and 
expands upon a strategy that has previously been used to 
identify outlying samples in microarray data generated from 
human brain tissue [21]. Our approach is applicable when- 
ever a dissimilarity or similarity measure can be defined be- 
tween samples (see Additional file 1). A major advantage of 
defining a network adjacency measure between samples (as 
opposed to a general similarity measure) is that it permits 
specification of network concepts. In our implementation, 
we define adjacencies among samples as signed weighted 
correlations with values that approximate the underlying 
correlations when these correlations are large, as is usually 
the case in sample networks (Methods). A signed weighted 
correlation network is attractive since it preserves sign in- 
formation, is robust with respect to the soft threshold 
(power) parameter (p), and preserves the continuous nature 
of correlations (i.e. the result is a fully connected network 
in which all nodes are neighbors with one another) [22]. In 
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addition, a signed correlation network is equivalent to a 
network based on the Euclidean distance between scaled 
vectors (as described in Additional file 1). 

Dataset 

The proposed framework for sample network explor- 
ation (Methods) was used to analyze microarray data 
from "the HD study" [14], These data were generated 
from brain samples of patients with HD (n = 44 individuals) 
and unaffected controls (n = 36 individuals, matched for 
age and sex) [14]. The authors of this study used Affyme- 
trix U133A microarrays to survey gene expression in caud- 
ate nucleus (CN), cerebellum (CB), primary motor cortex 
(Brodmanns area 4; BA4), and prefrontal cortex (Brod- 
manns area 9; BA9) in the CTRL group and across five 
grades of HD severity, which were scored between 0 (least 
severe) and 4 (most severe) using Vonsattels neuropatho- 
logical criteria [23]. HD causes extensive neurodegenera- 
tion in the CN, where medium spiny neurons are 
preferentially destroyed in early stages of the disease 
[15,23]; comparatively, the other analyzed brain regions are 
relatively spared. In addition to disease status and severity, 
sample information included age, sex, the country where 
the experiment was performed (samples were processed in 
the United States and New Zealand), and the microarray 
hybridization batch (Additional file 2) [14]. In light of these 
myriad biological and technical sources of variation, this 
dataset presents a challenging analytical task. 

A motivational example 

Below we provide an example that illustrates how net- 
work concepts can be used to distinguish samples when 
hierarchical clustering cannot. Figure 1A depicts a subset 
of samples from BA9 of CTRL subjects from the HD study. 
As seen in this example, visual inspection of the dendro- 
gram is sufficient to discern the outlying sample 
(BA9_91_C). However, it is illustrative to consider an alter- 
native depiction of sample relationships using the network 
concept of standardized connectivity. Standardized con- 
nectivity {Z.K; Methods) is a quantity that describes the 
overall strength of connections between a given node and 
all of the other nodes in a network. As seen in Figure 1C, 
the standardized connectivity of sample BA9_91_C is sig- 
nificantly lower than all of the other samples, confirming 
its status as an outlier in the group. It is important to note, 
however, that the distribution of standardized connectiv- 
ities is independent of the choice or use of clustering 
procedures. 

Figure IB shows the dendrogram produced by hier- 
archical clustering of another subset of samples from the 
HD study (CB of CTRL subjects). Here the dendrogram 
is more complex, with at least two samples (CB_80_C 
and CB_H123_C) that appear to be outliers, and others 
that are questionable. If the same samples are depicted in 



terms of Z.K (Figure ID), it is evident that three samples 
(CB_80_C, CB_H123_C, and CB_67_C) have Z.K values 
that are significantly lower than the other samples in the 
group. However, note that CB_H110_C, which is indistin- 
guishable from CB_67_C in the dendrogram above 
(Figure IB), has much higher Z.K than CB_67_C, indicat- 
ing that CB_67_C is an outlier whereas CB_H110_C is 
not. By establishing a threshold (e.g. Z.K= -2), standar- 
dized connectivity distributions can be used in a quantita- 
tive and unbiased fashion to identify and remove outlying 
samples, which may reflect hidden factors that can influ- 
ence the results of genomic experiments [24] (this approach 
is particularly useful when the number of samples is large, 
making it difficult to distinguish outlying samples in a den- 
drogram). Analogously, one can also make use of other net- 
work concepts as described below. 

Degradation of sample network topology in caudate 
nucleus by Huntington's disease 

We used the SampleNetwork R function to process all 201 
samples from the HD study simultaneously. As seen in 
Figure SI (Additional file 1) and our R tutorial (Additional 
file 3 and http://www.genetics.ucla.edu/labs/horvath/Coex- 
pressionNetwork/SampleNetwork), we observed a domin- 
ant effect of brain region on gene expression that was 
driven largely by the fact that gene expression in each non- 
cortical (CN and CB) brain region was quite distinct from 
gene expression in cortical (BA4 and BA9) brain regions, as 
has been described previously [25-28]. In light of the strong 
effect of brain region on gene expression, as well as the fact 
that HD preferentially targets CN relative to the other ana- 
lyzed brain regions, we next used SampleNetwork to exam- 
ine samples from each brain region separately. Within each 
brain region, we analyzed CTRL and HD samples as a sin- 
gle cohort, but note that alternative strategies (e.g., analyz- 
ing CTRL and HD samples as separate cohorts) may be 
desirable, depending on the downstream application. 

After constructing sample networks for each brain re- 
gion (as described in Additional file 3), we examined the 
relationship between the standardized sample connectiv- 
ity (Z.K) and the standardized sample clustering coeffi- 
cient (Z.C) for all samples in each brain region. We refer 
to this relationship as the standardized C(k) curve. As 
discussed below, (unstandardized) C(k) curves have been 
used to study the topological properties of scale -free net- 
works and other large complex networks [29-32]. We 
propose using the Spearman correlation to measure the 
standardized C(k) curve since it is invariant with regard 
to monotonically increasing transformations. In particu- 
lar, the Spearman correlation between Z.K and Z.C 
equals that of the unstandardized measures, which is 
why we denote it simply by cor(K,C) (Methods). In the 
following, we will demonstrate that the standardized C 
(k) curve is a valuable tool for i) assessing the overall 
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Figure 1 Network concepts provide a natural framework for describing relationships among samples in high-dimensional biological 
datasets. A motivational example. (A) Dendrogram produced by average linkage hierarchical clustering using 1 - ISA (intersample adjacency) for 
a subset of samples (prefrontal cortex [BA9] of CTRL subjects) from ref. [14]. (B) Dendrogram produced by average linkage hierarchical clustering 
using 1 - ISA for another subset of samples (cerebellum [CB] of CTRL subjects) from ref. [14]. (C) Standardized sample connectivities (Z./O provide 
a different view of the BA9 CTRL samples. BA9_91_C (red) exhibited significantly lower connectivity than the other samples in this group, 
consistent with the dendrogram (A). (D) Standardized sample connectivities for the CB CTRL samples. Three samples (CB_80_C, CB_H123_C, and 
CB_67_C, in red) had Z.K values that were significantly lower than the others. Note that CB_67_C had much lower connectivity than CB_H1 10_C 
(blue), yet these two samples were indistinguishable in the dendrogram above (B). Black horizontal lines in (C) and (D) correspond to an optional 
Z.K threshold (here -2) for outlier removal; CTRL = control. 



consistency of sample behavior within a dataset, ii) iden- 
tifying distinct groups of samples, and iii) identifying im- 
portant subsets of features (e.g. genes). 

For samples from prefrontal cortex (Figure 2A), motor 
cortex (Figure 2B), and cerebellum (Figure 2C), we observed 
that Z.K and Z.C formed nearly perfect inverse relation- 
ships, with no obvious distinctions between CTRL and HD 
subjects. In contrast, samples from the caudate nucleus 
exhibited clear segregation according to diagnosis, with 
CTRL and HD subjects forming two distinct groups 
(Figure 2D). This segregation indicates that cor(K,C) is a 
useful network concept that measures an important aspect 
of the global architecture in weighted sample networks. 
Interestingly, cor(K,C) for HD CN samples differed when 
brain regions were analyzed together (cor(K,C) - 0.77, 
P = 1.7e-08; Figure SID; Additional file 1) and when they 
were analyzed apart {cor{K,Q = -0.78, P = 4.0e-08; Figure 2D), 
suggesting that the relationship between the node-based 
measures Z.K and Z.C depends upon properties of the net- 
work as a whole, a topic that has been the subject of recent 
investigations [33]. 

Understanding the properties of the standardized C(k) 
curve 

As discussed below, the C(k) curve has been studied pri- 
marily in biological networks in which nodes correspond to 



gene products [30,32]. In contrast to the negative relation- 
ship observed in sample networks (Figure 2), we observed 
that Z.K and Z.C tended to exhibit a positive relationship in 
gene-based networks (e.g. Figure S2A,B; Additional file 1). 
A positive relationship was observed for genes that are 
naturally co-expressed in human caudate nucleus [21] {cor 
{K,Q = 0.7, P<2.2e-16; Figure S2A,C; Additional file 1), as 
well as for genes that were selected at random (cor 
(7C,C) = 0.83, P<2.2e-16; Figure S2B,D; Additional file 1). 
To understand why cor{K,C) is often positive in gene- 
based networks but negative in sample networks, consider 
that in most microarray studies, and in particular when 
analyzing similar biological specimens, samples are highly 
correlated with one another (e.g. r > 0.95 when measured 
across all genes). In contrast, most genes exhibit moderate 
to weak correlations with other genes, such that the mean 
correlation in a typical gene co-expression network is 
close to 0 and follows an approximately normal distribu- 
tion (e.g. Figure S2D; Additional file 1). Even for a module 
of co-expressed genes, when compared with sample net- 
works, the distribution of pairwise correlations is shifted 
towards smaller values (e.g. Figure S2C; Additional file 1). 
Therefore, we hypothesized that the contrasting relation- 
ships between Z.K and Z. C in sample networks and gene 
networks might relate to differences in the global topo- 
logical organization of each network. 
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Figure 2 Sample network concepts reveal the profound effect of Huntington's disease in caudate nucleus. Comparison of standardized 
sample connectivities {Z.K) and standardized clustering coefficients (ZQ between control subjects (CTRL) and subjects with Huntington's disease (HD) 
in prefrontal cortex (A; n = 9 CTRL and 1 6 HD), motor cortex (B; n = 1 6 CTRL and 1 4 HD), cerebellum (C; n = 23 CTRL and 34 HD), and caudate nucleus 
(D; n = 31 CTRL and 35 HD). Networks were constructed over all probe sets (n = 18,631) using all samples (CTRL and HD) from each brain region. 



To test this hypothesis, we conducted a simulation study 
to explore the properties of cor(K,C) by systematically vary- 
ing the network topology (mean node adjacency) and net- 
work size (number of nodes). For simulated networks with 
low mean node adjacency (i.e. mostly weak connections 
among nodes, like most gene co-expression networks), we 
observed values of cor(K,C) approaching 1 (Figure 3), indi- 
cating a nearly perfect positive linear relationship between 
Z.K and Z. C. As the strength of connections among nodes 
(i.e. mean node adjacency) began to increase, cor(K,C) 
began to shift, while also revealing a dependence on net- 
work size (i.e. number of nodes; Figure 3). This shift accel- 
erated dramatically as simulated networks began to consist 
of mostly strong connections among nodes, producing a 
"waterfall" effect reminiscent of a percolation transition 
[33] (Figure 3). When simulated networks possessed very 
high mean node adjacency (like most sample networks), 
cor(K,C) approached -1 (Figure 3), indicating a nearly per- 
fect negative linear relationship between Z.K and ZC. 

Collectively, these observations suggest that the divergence 
of cor(K,Q for HD CN samples relative to CTRL samples 
and other brain regions (Figure SID [Additional file 1], 
Figure 2D) reflects a degradation of global sample net- 
work topology in CN by HD. To visualize this degrad- 
ation more directly, we compared the distributions of 
pairwise sample adjacencies between CTRL and HD 



subjects for each brain region. The distributions of sample 
adjacencies exhibited the greatest difference between 
CTRL and HD subjects in CN, where HD sample adjacen- 
cies were markedly degraded (Figure S3; Additional file 1). 
Thus, degradation of global sample network topology by 
HD in CN has shifted cor(K,C) for HD CN samples. This 
relationship has begun to invert (i.e. it is "in the waterfall" 
[Figure 3]), indicating that HD has initiated a percolation- 
like transition in the global network topology of CN 
samples. 

Sample network topology reveals strong effects of 
Huntington's disease on specific gene co-expression 
modules in human caudate nucleus 

The degradation of global sample network topology by 
HD in CN (Figures SID, 2D, S3) was observed across all 
analyzed probe sets (n = 18,631). We hypothesized that 
this effect might vary for specific subsets of genes 
involved in disparate biological processes, which in turn 
might implicate specific biological processes in connec- 
tion with HD pathology. By focusing on pre-selected gene 
sets (informally referred to as modules), we illustrate below 
how the standardized C(k) curve can be used to identify 
clinically important subsets of features (i.e. genes). Toward 
this end, we make use of a second R function called 
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Figure 3 cor{K,Q depends upon network topology and network 
size. The Spearman correlation {cor{K,Q; z-axis) between the 
connectivity and the clustering coefficient as a function of network 
density (mean node adj. [adjacency]; x-axis) and network size (nodes; 
y-axis). Signed networks ((3 = 2) were simulated using the 
simulateModule function from the WGCNA R package [34]. The seed 
module eigengene (ME) consisted of 5,000 random, normally 
distributed features (mean = 0, sd = 1). The function parameters 
"corPower" and "propNegativeCor" were set to 0.75 and 0, respectively. 
The function parameter "minCor" was iteratively reduced from .95 to .05 
by increments of .05, progressively degrading the strength of node 
connections; for each iteration, cor{K,Q was calculated for module 
networks of various sizes (n = 1 0 to 1 00, by = 1 0). 



ModuleSampleNetwork (and refer to the resulting sample 
networks as "module sample networks"). 

We have previously shown that the transcriptome of 
normal human CN is organized into modules of co- 
expressed genes, many of which relate to specific cell types 
and functional processes [21]. For example, gene co- 
expression modules corresponding to oligodendrocytes, 
astrocytes, neurons, mitochondrial function, synaptic 
function, immune response, gender differences, and the 
subventricular neurogenic niche have been described in 
human CN [21]. Subsequent work in rodents has con- 
firmed that striatal gene co-expression network architec- 
ture is robust across disparate strains of mice [35]. The 
inherent organization of the CN transcriptome provides a 
natural framework in which to study the effects of HD on 
sample network topology. Therefore, we sought to deter- 
mine the extent to which variation in sample network top- 
ology was associated with particular gene co-expression 
modules in CN. Specifically, we constructed sample net- 
works in CN for each of the 23 gene co-expression mod- 
ules that were previously identified in this brain region in 
humans [21]. The 23 gene co-expression modules are 



labeled by colors (e.g. the "palegreen" module), with pertin- 
ent functional characterizations taken from ref. [21]. 

To assess the effects of HD on module sample network 
topology, we calculated cor{K,C) for CTRL and HD sub- 
jects in every module (Figure 4A). Based upon the relation- 
ship observed between Z.K and Z.C for CTRL and HD 
subjects in BA9, BA4, and CB (Figure 2A-C), we hypothe- 
sized that in the absence of an effect of HD on module 
sample network topology, cor(K,C) CTRL should approxi- 
mately equal cor(K£) HD. In addition, for module sample 
networks characterized by strong connections among 
nodes, we expected cor(K,C) to approach -1 (Figure 3). 
The majority of modules clustered along the diagonal, indi- 
cating relative preservation of cor{K,C) between CTRL and 
HD subjects; however, a handful of modules were clearly 
distinguished as outliers (Figure 4A). Among the outliers, 
the difference in cor{K,C) between CTRL and HD subjects 
was most significant for the salmon module (M8C), fol- 
lowed by the black (M11C; Figure S4; Additional file 1), 
royalblue (M36; Figure S5; Additional file 1), and red 
(M19C; Figure S6; Additional file 1) modules (Figure 4B). 
These results indicate that cor{K,C) is a useful measure for 
highlighting differences in sample network topology among 
subsets of genes. 

In the original HD study [14], the authors determined 
that a large fraction (-20%) of transcripts showed differen- 
tial expression (DE) in post-mortem CN between CTRL 
and HD subjects. DE in HD is thought to reflect both cell- 
intrinsic changes in gene expression (i.e. changes in gene 
expression induced by the mutant huntingtin protein), as 
well as changes at the cellular population level due to neur- 
onal cell death and subsequent astrogliosis [14,17,20]. In 
light of such widespread changes, we asked whether par- 
ticular gene co-expression modules were associated with 
DE. As shown in Figure 4C, many modules were signifi- 
cantly associated with DE. This result is perhaps not sur- 
prising, inasmuch as cellular stoichiometry is altered by 
HD and many modules have been shown to be enriched 
with cell type-specific genes [21]. We next sought to relate 
the extent of modular DE with the extent of modular deg- 
radation in sample network topology. As shown in 
Figure 4D, the salmon module was the most significant in 
both of these dimensions, followed by the black and royal- 
blue modules. Overall, however, the relationship between 
these two measures was weak (r = 0.41, P = 5.2e-02). In- 
deed, one module (red) exhibited a very significant differ- 
ence in cor{K,C) between CTRL and HD subjects, with no 
significant evidence of differential expression (Figure 4D). 

cor(K,Q can distinguish sample groups in the absence of 
differential expression 

To explore the basis for this observation, we conducted a 
simulation study to determine whether cor(K,Q could dis- 
tinguish subsets of samples in the absence of differential 
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Figure 4 Huntington's disease exerts strong effects on specific gene co-expression modules in human caudate nucleus. Analysis of 
human caudate nucleus (CN) sample network properties for each of 23 gene co-expression modules previously identified in CN; colors 
correspond to the original gene co-expression module labels from [21]. (A) For each module sample network, the Spearman correlations cor{K,Q 
are plotted for control (CTRL) and Huntington's disease (HD) subjects. Each point corresponds to a module. Black line: y = x. (B) The log- 
transformed P-value of the difference between cor{K,Q for CTRL and HD subjects is reported for each module (Methods). (C) The extent of 
differential expression (DE) between CTRL and HD was assessed for each module by using Student's t-test of DE for the module eigengene (ME; 
i.e. the first principal component obtained by singular value decomposition of the module expression matrix) between CTRL and HD. (D) 
Comparison of the module significance levels reported in (B) and (C); linear least squares regression line in black. p.Diff.cor(K,C) denotes the 
P-value for testing the differences of cor{K,Q between the CTRL and HD module sample networks. (B-D) Blue lines: P=.05; red lines: Bonferroni 
correction for multiple comparisons. 



expression. Specifically, we simulated a set of 500 genes 
and 100 samples (referred to as a "module"), using the real 
structure of the red module as an approximate guide 
(Methods). Samples were assigned to one of three groups 
using a simulated sample trait (referred to as "disease sta- 
tus"), with 50 samples corresponding to control status, 25 
samples corresponding to moderate disease status, and 25 
samples corresponding to severe disease status (Methods). 
The simulation model assumed i) that 60% of the module 
genes were not related to the disease and ii) that these 
noise genes had lower mean values than the 40% of (signal) 
genes that were down-regulated by the disease. Figure 5A 
depicts the dendrogram produced by hierarchical cluster- 
ing of sample adjacencies for the simulated module. As 
seen in Figure 5B, the observed module eigengene was not 



related to disease status (P = 0.18, Kruskal-Wallis test). In 
contrast, cor{K,C) clearly delineated the control samples 
from the affected samples (Figure 5C), despite inconsistent 
gene expression differences among the three sample 
groups (Figure 5D). These results provide further evidence 
that cor{K,C) can distinguish meaningful groups of samples 
in certain situations where differential expression analysis 
cannot. 

A neuronal signal transduction module is profoundly 
degraded by Huntington's disease 

Figure 6 depicts the results of sample network construction 
for the CN salmon module (similar depictions for the 
black, royalblue, and red modules can be found in Figures 
S4, S5, and S6, respectively). Hierarchical clustering of 
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Figure 5 cor{K,C) distinguishes sample subgroups in the absence of differential expression. Analysis of a simulated gene expression 
module consisting of 500 genes and 100 samples. Samples were assigned to one of three subgroups based on simulated disease status: "control" 
(n = 50; darkgreen), "moderate" (n = 25; red), or "severe" (n = 25; turquoise) (Methods). (A) Average linkage hierarchical clustering of samples using 
1 - ISA (intersample adjacency) as a dissimilarity measure. (B) Distributions of module eigengene (ME) values by sample subgroup. Note that 
these distributions are not significantly different (P = 0.18, Kruksal-Wallis test), indicating that there is no differential expression associated with 
disease status at the modular level. (C) When depicted in terms of Z.K and ZC, control and affected subjects segregated into two distinct groups 
(linear least squares regression lines in black [control] and red [affected]). (D) Heat map of simulated gene expression levels. Rows correspond to 
genes and columns correspond to samples. Green = low expression; red = high expression. 



sample adjacencies produced a dendrogram with two large 
branches (Figure 6A). The first branch formed a cluster 
comprised exclusively of HD samples (cluster 1), 85% of 
which were Vonsattel grade 2 or higher (i.e. later stages of 
disease progression). The second branch subdivided to 
produce two sample clusters. 91% of the samples in cluster 
2 corresponded to unaffected individuals, with the remain- 
der consisting of grade 1 (n = 2) or grade 0 (n = 1) HD sam- 
ples. Cluster 3 was comprised almost exclusively of HD 
samples, all of which were grade 2 or below (i.e. earlier 
stages of disease progression). 

Examination of the distribution of Z.K among samples in 
the salmon module (Figure 6B) also revealed a distinction 
among grades of HD severity. Grade 1 and a subset of 
grade 2 HD samples possessed Z.K values that were com- 
parable to those of unaffected individuals; however, a ma- 
jority of grade 2 samples and grade 3 samples possessed Z. 
K values that were substantially lower than all other sam- 
ples (Figure 6B). In contrast, examination of Z.C revealed a 
monotonic arrangement of samples, with CTRL > grade 
1 > grade 2 > grade 3 (Figure 6C). When plotted in both of 
these dimensions, samples formed two distinct lines that 



clearly delineated CTRL and HD subjects (Figure 6D). 
Interestingly, three HD samples (two grade 1 and one 
grade 0) fell upon the same regression line as the CTRL 
samples (Figure 6D, black line); these were the same sam- 
ples that belonged to cluster 2 in Figure 6A. It is possible 
that the intermingling of some early stage HD samples 
with CTRL subjects could reflect the continuum of neuro- 
degeneration that spans from normal aging to neurodegen- 
erative disease. We also observed that the distribution of 
HD samples along their regression line tended to reflect 
their grade of severity (Figure 6D, red line). These results 
provide visual confirmation of the significant distinction 
between CTRL and HD subjects in the salmon module 
reported above (Figure 4A,B). In addition, multivariate lin- 
ear regression using the salmon module eigengene (i.e. the 
first principal component of gene expression in the salmon 
module) as outcome confirmed an extremely significant ef- 
fect of diagnosis (Dx) on gene expression in this module, 
as well as significant independent effects for grade and age 
(Figure 6E). The effect of diagnosis on gene expression was 
evident when gene expression in the salmon module was 
visualized directly (Figure 6F). 
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Figure 6 Caudate nucleus samples exhibit significant segregation by diagnosis in gene co-expression module M8C (salmon). Analysis of 
caudate nucleus (CN) sample network properties for genes comprising the CN salmon co-expression module M8C [21]. (A) Average linkage 
hierarchical clustering of samples using 1 - ISA (intersample adjacency) as a dissimilarity measure. Colors denote control (CTRL) subjects 
(darkgreen; n = 31) and Huntington's disease (HD) subjects with varying grades of disease severity: HD grade 0 (black; n = 2), HD grade 1 (red; 
n = 1 1 ), HD grade 2 (turquoise; n = 1 6), HD grade 3 (blue; n = 5), and HD grade 4 (brown; n = 1 ). Standardized sample connectivities (Z/C; B) and 
standardized sample clustering coefficients (ZC; C). (D) HD and CTRL samples segregated into two distinct groups when depicted in terms of Z/C 
and ZC (linear least squares regression line in black [CTRL] and red [HD]). (E) Multivariate linear regression revealed a highly significant effect of 
diagnosis (Dx) on the salmon module eigengene. Blue line: P = .05; red line: Bonferroni correction for multiple comparisons. (F) Heat map of 
expression levels for genes comprising the salmon co-expression module M8C. Rows correspond to probe sets (genes) and columns correspond 
to samples. Green = low expression; red = high expression. Samples in (B-D, F) are colored as in (A). 



As can be seen in Figure 6F, the vast majority of genes 
in the salmon module showed decreased expression 
levels with increasing severity of HD, which would be 
expected as a consequence of neuronal cell death (not- 
withstanding cell-intrinsic changes in gene expression 



induced by the mutant huntingtin protein). When it was 
originally described, the salmon gene co-expression 
module in human CN was found to be enriched with 
genes that are preferentially expressed in neurons, genes 
that encode synaptic proteins, and genes involved in 
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signal transduction [21]. Analyses of differential expres- 
sion, functional enrichment, and membership strength for 
all genes in the salmon module are summarized in 
Additional file 4. To dissociate changes in gene expression 
caused by altered cellular stoichiometry in HD from 
changes in gene expression caused by cell-intrinsic effects 
of the mutant huntingtin protein, we cross-referenced CN 
module composition with a set of genes that has been 
found to be dysregulated in primary neuron models of HD 
[20]. In the study by Runne et al, the effects of mutant 
huntingtin on gene expression were measured before cell 
death in primary striatal neurons cultured from rat brains 
[20]. We observed that the salmon module was significantly 
enriched with this set of dysregulated genes, and more so 
than any other module (Figure S7; Additional file 1). We 
also note that a number of genes in the salmon module 
were previously found to be differentially expressed in 
laser-microdissected striatal neurons of CTRL and HD 
human subjects [14] (Additional file 4). 

Lastly, we used Ingenuity Pathways Analysis (IPA) to de- 
termine whether the salmon module was enriched with 
annotated functional categories of genes. Out of more than 
500 annotated functional categories of genes in the IPA 
database, the two categories that showed the most signifi- 
cant enrichment with genes from the salmon module were 
"dyskinesia" (FDR P= 1.4e-24) and "Huntington's disease" 
(FDR P = 1.6e-24) (Additional file 5). 

Discussion 

To the best of our knowledge, this work provides the 
first formal demonstration that network methods can 
distinguish biologically meaningful relationships among 
samples in genomic datasets. We have shown that sam- 
ple networks can identify outlying samples when hier- 
archical clustering procedures cannot, and even when 
hierarchical clustering procedures are not used at all. 
We have described a novel network statistic, cor(/<T,C), 
and shown that it can be used to i) evaluate sample 
homogeneity, ii) identify sample characteristics (e.g. 
diagnosis) with global effects, and iii) enable compari- 
sons among groups of samples using pre-selected lists of 
features (e.g. gene co-expression modules). By applying 
the latter approach to microarray data generated from 
human brain tissue, we have identified a neuronal signal 
transduction module that is an epicenter of transcrip- 
tional dysregulation in striatal samples from individuals 
with HD. The advantages of using network methods for 
describing sample relationships in genomic datasets are 
summarized below. 

A major advantage of constructing sample networks is 
that individual samples can subsequently be described using 
established node-based network concepts such as the con- 
nectivity and the clustering coefficient. These concepts are 
independent of the choice or use of clustering algorithms 



and depend only on the adjacency measure used to 
construct the network. The distributions of standardized 
node-based network concepts provide an unbiased and 
quantitative framework for identifying samples that "be- 
have" differently, even if the underlying causes of this be- 
havior are unknown. Intuitively, if the connectivity for a 
given sample (when measured over all genes) is significantly 
lower than all other sample connectivities from the same 
biological system, it suggests that there is something differ- 
ent about that sample compared to the others. The investi- 
gator must ask him/herself whether the observed difference 
is likely to reflect biological or technical variation. In light 
of the multiple steps that comprise a typical genomic ex- 
periment, each of which may introduce technical variation, 
a conservative approach is to exclude aberrant samples if 
there are no obvious biological factors that might explain 
their discordant behavior. 

Compared with other methods for identifying outlying 
samples in genomic data, our approach offers several 
additional advantages. First, because sample relation- 
ships are defined with respect to a correlation matrix, it 
is platform-agnostic and does not require access to raw 
data (although in practice it is preferable to process raw 
data in a consistent fashion). Second, it is easily applied 
to very large datasets, in contrast to clustering proce- 
dures that rely upon visual inspection of dendrograms to 
identify outlying samples. Third, it produces a battery of 
measures for summarizing the consistency and integrity 
of genomic datasets (e.g. mean intersample adjacency 
[ISA, or density], decentralization, homogeneity, etc.), 
which can be compared across disparate studies, tech- 
nology platforms, and biological systems. Such measures 
are especially useful for meta-analyses, where objective as- 
sessment of data quality is highly desirable before seeking 
to pool or compare results across studies. Finally, as imple- 
mented in SampleNetwork and described in Additional file 
3, our approach is both flexible and efficient, enabling users 
to move quickly through large datasets in an iterative fash- 
ion, specifying groups of samples for processing, identifying 
and removing outliers, testing the significance of sample 
covariates, and performing data normalization. To enhance 
user-friendliness, we have also incorporated the R function 
ComBat [36], which is an effective tool for removing batch 
effects (Additional file 1). At each stage, relevant output 
files are produced and exported automatically. 

At the same time, there are several important caveats 
associated with our proposed approach for using net- 
work concepts to identify outlying samples in genomic 
data. It should be noted that our approach works best 
for datasets with large numbers of samples (e.g. more 
than 10). It is also important to note that standardized 
network concepts such as Z.K are relative measures 
whose interpretation depends on context. For example, 
in a relatively homogeneous sample network (e.g. mean 
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ISA > 0.97), a Z.K value of -2.5 implies higher adjacen- 
cies for the sample in question than it would in a more 
heterogeneous sample network (e.g. mean ISA < 0.9). In 
light of these considerations, it can be helpful to have 
"targets" in mind, such as an expectation of what the 
mean ISA should approach for a given biological system, 
technology platform, and adjacency measure. These tar- 
gets can be guided by prior experience (for example, 
cancer datasets often exhibit substantial sample hetero- 
geneity) or by the use of technical and biological repli- 
cates. Lastly, although we have focused primarily on Z.K 
and to a lesser extent ZC as intuitive indicators of outly- 
ing status, it is possible that other node-based network 
concepts (or indeed, other measures of adjacency) could 
produce different results. 

Beyond facilitating relatively simple tasks such as outlier 
identification, sample networks provide a novel perspec- 
tive on more complex challenges such as group compari- 
sons. Our results indicate that the standardized C(k) curve 
in weighted sample networks is a powerful tool for identi- 
fying sample characteristics with global effects on genomic 
activity. The stark divergence of cor(K,C) for HD CN sam- 
ples motivated us to explore how cor(K,C) would be 
affected by other network topologies, leading to the 
observation that cor{K,C) undergoes a percolation-like 
transition that is related to network density and size. Al- 
though cor{K,C) was inversely related to network density 
in our simulations, we note that cor(K,C) is invariant if 
one scales all off-diagonal adjacencies by a constant. 
Therefore, it is more accurate to consider cor(K,C) as an 
indicator of network heterogeneity (or homogeneity; 
Additional file 1). In the special situation of an exactly fac- 
torizable network, we find that cor{K,C) is determined by 
the network heterogeneity (Methods). One practical impli- 
cation of these findings is that cor(K,C) can serve as a use- 
ful indicator of data "cleanliness": with each iteration of 
sample outlier removal or data normalization performed 
using SampleNetwork, cor(I<£) should approach -1. 

We note that our findings with respect to the 
percolation-like transition for cor(K,C) are also applic- 
able to unweighted (binary) networks. We have observed 
a similar transition for cor(K,C) in unweighted gene net- 
works as the threshold for dichotomizing the adjacency 
matrix is progressively increased (Figure S8; Additional file 1). 
At permissive (low) thresholds, which produce net- 
works in which most nodes are connected, cor(K,C) 
is negative; as the threshold is raised, producing net- 
works in which most nodes are not connected, the rela- 
tionship begins to invert, becoming positive at more 
stringent (high) thresholds (Figure S8; Additional file 1). 

In unweighted networks, the relationship between the 
(unstandardized) connectivity and (unstandardized) clus- 
tering coefficient of network nodes, i.e. the C(k) curve, has 
previously been reported to follow a scaling law: C = k a 



[29,31]. It has been shown that the value of the scaling ex- 
ponent a is not universal, but negative values approaching 
-1 have been observed in biological systems [30,32]. The 
inverse relationship for the C(k) curve has been interpreted 
as evidence of hierarchical modularity in network structure 
[30,31]. Specifically, it has been suggested that in hierarch- 
ically modular networks, nodes with low connectivity form 
small, densely connected clusters, while nodes with high 
connectivity serve to bridge these many small clusters into 
one large, integrated network [31]. However, the C(k) curve 
has primarily been studied in the context of metabolic, 
protein interaction, and gene regulatory networks, as well 
as other non-biological networks [30,32,37]. 

To the best of our knowledge, a percolation-like transi- 
tion in the C(k) curve has not previously been reported. 
However, prior work has revealed that global topological 
properties of unweighted networks, such as those em- 
bodied in the C(k) curve, can be predicted by knowledge of 
local motif structure, and vice versa [33]. Motifs, or sub- 
graphs, describe basic interaction patterns among small 
groups of nodes [38,39]. In unweighted networks, it has 
been shown that subgraphs naturally segregate into two 
classes: highly abundant type I subgraphs, which are 
sparsely interconnected, and less abundant type II sub- 
graphs, which are densely interconnected [33]. It has also 
been shown that a phase boundary separating type I and 
type II subgraphs can be accurately predicted using global 
network topological properties, including the C(k) curve 
[33]. Therefore, we propose that the transition in the stan- 
dardized C(k) curve observed in our analysis reflects a con- 
comitant transition in local motif structure, which in turn 
reflects the degradation of sample network topology in CN 
by HD. Although motifs have been studied almost univer- 
sally in the context of unweighted networks, we are aware 
of at least one study that has presented an approach for 
generalizing motif scoring to weighted networks [40]. Our 
results suggest that future research investigating the rela- 
tive strengths of distinct motifs in weighted networks and 
their relationship to global network topological properties 
is warranted. 

The effect of HD on the standardized C(k) curve for CN 
samples was initially observed over all genes, which is con- 
sistent with the large impact that HD exerts on the CN 
transcriptome [14,17,19,20,41]. Because the transcriptomes 
of human brain regions, including CN, are organized into 
biologically meaningful gene co-expression modules [21], 
we reasoned that constructing sample networks for previ- 
ously identified CN modules might expose variation in the 
standardized C(k) curve, which in turn might implicate 
specific biological processes in connection with HD path- 
ology [42]. This approach constitutes a novel strategy for 
exploring the effects of disease on sets of genes. We identi- 
fied several modules that exhibited highly significant differ- 
ences in cor{K,C) between CTRL and HD subjects in CN. 
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One potential drawback of our approach is that relatively 
small differences in cor{K,C) can appear significant as \cor 
{KyC)\ approaches 1; for example, M34 was significant 
despite a relatively small difference between CTRL (cor 
(K,C) = -0.98) and HD (cor(K,C) = -0.91) subjects. For the 
four most significant modules, however, the differences in 
cor(K,C) were > 1, indicating that the standardized C(k) 
curve had flipped from negative (CTRL) to positive (HD). 

As illustrated above, differences between standardized 
C(k) curves are not simply a proxy for differences in net- 
work density, but also relate to network size and hetero- 
geneity. We have also observed that small numbers of 
samples that are highly discordant (i.e. severe outliers) can 
have a large impact on the standardized C(k) curve (M.C. 
O. and S.H., unpublished observations). Thus, the standar- 
dized C(k) curve is an aggregate measure, and one that 
may be used to complement existing strategies for con- 
ducting both unsupervised and supervised analyses. We 
also note that in the present study, the overall relationship 
between differential expression (DE) and differences be- 
tween the standardized C(k) curves of CTRL and HD sub- 
jects was weak. For example, although the salmon module 
(which exhibited the most significant difference in cor(K,C) 
between CTRL and HD) was strongly associated with DE, 
the red module (which also exhibited a significant differ- 
ence in cor(K,C) between CTRL and HD) was not. Further- 
more, our simulation study confirms that situations may 
exist in which cor(K,C) can distinguish meaningful sample 
subgroups in the absence of DE. These findings deserve 
additional study. 

Conclusions 

As genomic technologies proliferate and genomic studies 
grow ever larger, it is critical that methods to assess sample 
heterogeneity evolve in parallel. We have presented a stan- 
dardized approach for sample network analysis that can 
detect outlying samples in the absence of hierarchical clus- 
tering. We have also described a novel network statistic, 
cor(K,C), and demonstrated that it can be used to assess 
sample homogeneity, identify sample traits with global 
effects, and facilitate supervised comparisons among 
groups of samples using pre-selected subsets of features. 
Application of the latter approach to microarray data gen- 
erated from human brain tissue identified a neuronal signal 
transduction module as an epicenter of transcriptional dys- 
regulation in striatal samples from individuals with HD. To 
the best of our knowledge, these findings provide the first 
formal demonstration that network methods can distin- 
guish biologically meaningful relationships among samples 
in genomic datasets. The dataset analyzed in this study, 
along with the SampleNetwork and ModuleSampleNetwork 
R functions and a comprehensive tutorial illustrating their 
usage, are available on our web site (http://www.genetics.ucla. 
edu/labs/horvath/CoexpressionNetwork/SampleNetwork). 



Methods 

R software implementation 

We have implemented the sample network approach in a 
freely available, custom R software function called Sample- 
Network. SampleNetwork has been designed to facilitate 
detailed exploration of sample relationships and expedite 
genomic data pre-processing decisions via sample network 
analysis. SampleNetwork enables semi-automatic, inter- 
active sample network construction and network concept 
calculations. Network concepts include node-based mea- 
sures such as the standardized sample connectivity (ZK) 
and the standardized sample clustering coefficient (Z.C), as 
well as network-based measures such as cor(K,C) and 
the mean inter-sample adjacency (ISA, or density). These 
concepts and many others are defined below and in Supple- 
mentary Methods (Additional file 1). By calculating the dis- 
tributions of node-based sample network concepts, 
SampleNetwork enables the user to identify and remove 
outlying samples in an iterative and interactive fashion; by 
calculating network-based sample network concepts, Sam- 
pleNetwork enables the user to gauge overall progress to- 
wards data cleanliness and sample homogeneity. These 
features are described in detail in our online tutorial (see 
below and Additional file 3). SampleNetwork also enables 
significance testing of sample covariates with respect to 
sample metrics, and data normalization. Data normalization 
may be performed pursuant to outlier removal using the 
quantile normalization method proposed in ref. [43]. 

Because sample networks often reveal groupings of sam- 
ples that reflect batch effects (technical variation), which 
are typically not removed by standard normalization pro- 
cedures, we have also incorporated existing methods that 
allow the user to automatically correct for batch effects. 
Specifically, we have found that the R function ComBat 
created by Johnson and colleagues [36] is quite adept at 
removing batch effects. Consequently, if batch effects are 
present, the user has the option of correcting for them by 
calling ComBat from within SampleNetwork, which 
automates its execution. SampleNetwork also requires in- 
stallation of the following R (http://www.r-project.org/) 
and Bioconductor (http://www.bioconductor.org/) packages: 
affy [44], cluster, impute [45], preprocessCore, and WGCNA 
[34], With each successive round of data processing, 
SampleNetwork produces and exports the results of 
sample network analysis automatically (e.g. Figure SI; 
Additional file 1). 

We have also created a companion R software func- 
tion called ModuleSampleNetwork to explore the prop- 
erties of sample networks when formed over subsets of 
features. In our application, subsets of features corres- 
pond to modules of co-expressed genes [21], but we 
note that subsets can be defined by the user according 
to any criteria. ModuleSampleNetwork does not enable 
outlier testing and removal or data normalization, but 
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instead seeks to compare module sample network prop- 
erties between subgroups of samples (e.g. Figure 6) and 
across modules (e.g. Figure 4). An example workflow 
would involve using SampleNetwork to pre-process a 
microarray dataset, then using WGCNA [34] to identify 
modules of co-expressed genes, and finally using Modu- 
leSampleNetwork to explore sample network properties 
at the modular level. 

While both SampleNetwork and ModuleSampleNetwork 
are user-friendly, they are interactive and require judicious 
feedback from the user (for example, regarding thresholds 
for outlier removal). To illustrate how the software can be 
used in practice, we provide a detailed, annotated tutorial 
with R code (Additional file 3) highlighting the required in- 
put files, parameter choices, user interactions, and resulting 
output files. The beneficial effects of outlier detection and 
removal, data normalization, and correction for batch 
effects, as implemented using SampleNetwork, are clearly 
delineated by significance testing of sample covariates with 
respect to sample metrics, analysis of differential expres- 
sion, and analysis of network concepts with each successive 
round of data processing, as described in the online tutor- 
ial. This tutorial, (Additional file 3) along with the required 
input files and the SampleNetwork and ModuleSample- 
Network R functions, is available on our web site (http:// 
www.genetics.ucla.edu/labs/horvath/CoexpressionNet- 
work/SampleNetwork). 

Microarray data pre-processing 

Raw microarray data (.CEL files) [14] were downloaded 
from Gene Expression Omnibus (http://www.ncbi.nlm.nih. 
gov/geo/query/acc.cgi?acc=GSE3790). Detailed information 
on sample characteristics and sample processing can be 
found in [14]. A summary of sample characteristics can also 
be found in Additional file 2. To eliminate non-specific and 
mis-targeted probes prior to generating expression values, 
a mask file ("HG-U133A") was obtained from http:// 
masker.nci.nih.gov/ev/ [46] and applied to the raw micro- 
array data using the R (http://www.r-project.org/) package 
"ProbeFilter" [47] (http://arrayanalysis.mbni.med.umich. 
edu/MBNIUM.html#ProbeFilter). After applying the mask 
file, only probe sets with at least seven remaining probes 
were retained for further analysis (n = 18,631). Expression 
values were generated in R using the "expresso" function 
of the "affy" package (http://www.bioconductor.org/) [48] 
with "mas" settings and no normalization, followed by 
scaling of arrays to the same average intensity (200). 

Sample networks based on general similarity or 
dissimilarity measures 

The input of most clustering procedures is a similarity or 
dissimilarity measure. In Additional file 1, we define these 
measures and describe general approaches for turning a 
similarity or dissimilarity matrix into a sample network. 



Defining sample adjacency 

To construct sample networks, a measure of connection 
strength, or adjacency, is defined for each pair of samples i 
and ; and denoted by A mathematical constraint on 
is that its values must lie between 0 and 1. In our imple- 
mentation, we defined the adjacency between (microarray) 
samples S t and Sj as follows: 

2 J 

where (3 = 2. Technically, is a signed weighted adjacency 
matrix [22,49]. A major advantage of defining a network 
adjacency measure (as opposed to a general similarity 
measure) between samples is that it allows specification of 
network concepts (see below). Our proposed sample adja- 
cency measure (based on |3 = 2) also has several other 
advantages. First, it preserves the sign of the correlation 
(although in most applications negative correlations among 
samples are unlikely to occur). Second, it preserves the 
continuous nature of the correlation information; alterna- 
tive approaches based on thresholding the correlation coef- 
ficient may lead to information loss. Third, while any other 
power |3 could be used, the choice of (3 = 2 results in an ad- 
jacency measure that is close to the correlation when the 
correlation is large (e.g. larger than 0.6, which is often the 
case among samples in microarray data). 

We note that SampleNetwork also allows the user to de- 
fine sample adjacencies using Euclidean distance, which 
may be desirable in some applications. Future efforts may 
seek to compare the properties of sample networks using 
these and other adjacency measures. 

Network concepts 

After constructing an adjacency matrix, nodes (samples) 
can be characterized in terms of a number of existing net- 
work concepts (see refs. [10,12] for comprehensive over- 
views of network concepts). Several of these concepts are 
reviewed briefly below, including the connectivity (also 
known as degree in unweighted networks) and the cluster- 
ing coefficient, which we find to be particularly useful in 
the context of sample networks. 

Connectivity 

The connectivity (k) of the i-th network node is defined 
by: 

hi — a[j. 

The maximum connectivity is defined as: 

/<max = max (kj) . 
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The scaled connectivity K[ of the i-th network node is 
defined as: 

/v max 

The standardized connectivity Z.K X of the z-th network 
node is defined as: 

= /<Q-mean(/C) 
' ' y^K) 

Sample network interpretation of the connectivity: 
Using our proposed measure of sample adjacency 
(signed weighted network with |3 = 2), we find that 

k*! H cor(S h Sj) 

if all sample correlations are > 0.6. In other words, samples 
with high connectivity tend to be highly positively corre- 
lated with other samples. The connectivity is the most 
widely used concept for distinguishing the nodes of a net- 
work. As illustrated in the motivational example above 
and as detailed in our R tutorial (Additional file 3), sam- 
ples with low connectivity may represent outliers. 

Clustering coefficient 

The clustering coefficient (C) of node i measures the 
density of local connections, or "cliquishness" [11]. For 
weighted networks, 0 < a t j < 1 implies that 0 < Q < 1 [22] : 

{(luiauf-Zi^au) 2 } 

The standardized clustering coefficient Z,Q of the i-th 
network node is defined as: 

zc = Q-mean(C) 
V var(C) 

Sample network interpretation of the clustering coeffi- 
cient: The higher the clustering coefficient of a sample, 
the higher is the average pairwise correlation among its 
closest neighbors. If all of a samples closest neighbors 
have pairwise correlations of -1, the clustering coeffi- 
cient will be zero. 



refer to the density as the mean intersample adjacency 
(ISA). 

Sample network interpretation of the density: Using our 
proposed measure of sample adjacency (signed weighted 
network with (3 = 2), we find that 

... ZiZi*jCor(S h Sj) 

mean( A)~ — — — 

n(n-l) 

if all sample correlations are > 0.6. Thus, the mean adja- 
cency is roughly equal to the mean correlation in sample 
networks. 

The standardized C(/c) curve and cor(K,Q network concept 

Empirical results obtained through application of the Sam- 
pleNetwork R function to many datasets indicated that as 
outlying samples are removed, data are normalized, and 
technical artifacts (e.g. batch effects) are corrected, Z.K and 
Z.C exhibit a progressively linear, inverse relationship. A 
similar relationship has been observed in unweighted (bin- 
ary) networks, where the relationship between the (unstan- 
dardized) connectivity and (unstandardized) clustering 
coefficient of network nodes, i.e. the C(k) curve, has previ- 
ously been reported to follow a scaling law (C = k a 
[29,31]), with values approaching -1 often observed for the 
scaling exponent a in biological systems [30,32]. It has been 
suggested that this relationship may emerge as a conse- 
quence of hierarchically modular networks, where nodes 
with low connectivity form small, densely connected clus- 
ters, and nodes with high connectivity serve to bridge these 
many small clusters into one large, integrated network [31]. 

We define the standardized C(k) curve as a scatter plot 
between Z.K and Z.C where Z.K and Z.C denote the stan- 
dardized sample connectivity and the standardized sample 
clustering coefficient, respectively. We also introduce a new 
network concept, cor(/<r,C), which we define as the Spear- 
man correlation between Z.K and Z.C. We note that other 
measures of correlation could also be used (e.g. Pearson 
correlation). Since the Spearman correlation is invariant 
with respect to a monotonically increasing transformation 
(e.g. standardization), we find that 

cor(K^ C) = cor(Z.K 1 Z.C) = cor(k 1 C), 



Density and mean intersample adjacency (ISA) 

We find it useful to characterize sample networks using 
the mean (off-diagonal) adjacency measure, i.e. 



mean(A) 



n(n-l) 



where A = [a i; ]. The mean adjacency is also known as the 
density of the network. In sample networks, we often 



where k denotes the unsealed connectivity. As described in 
Results, we find that cor(K,C) is inversely related to the 
density (i.e. mean adjacency) in simulated networks. How- 
ever, because cor{K,C) is invariant if one scales all off- 
diagonal adjacencies by a constant, it is more accurate to 



Heterogeneity 



\/var(k) 
mean(k) 
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consider cor(K,C) as an indicator of network heterogeneity. 
The network concept Heterogeneity is defined as: 

Let us briefly consider the special case of an exactly 
factor izable network in which the network adjacency 
factors into node-specific contributions (a^ = CF(/) CF(/)) 
[10,50]. In this case, we have shown that the Spearman 
correlation cor(K,C) is actually determined by the net- 
work heterogeneity: 



cor(K, C) « 0.96-2.19 



Ik 



nLkf 



0.96-2.19- 



1 



1 + Heterogeneity 2 



Thus, cor(K,C) close to 1 indicates that network het- 
erogeneity is high. Divergence of cor(K,C) from 1 (in a 
negative direction) implies increasing homogeneity; once 
a critical level of homogeneity in the network is breached 
(analogous to a percolation transition [33]), cor(K,C) 
becomes negative. In practice, however, the relationship 
described above does not generalize to non-factorizable 
networks. In our real data applications that involve non- 
factorizable networks, cor(K,C) also exhibits a dependence 
on the network size n. 

Identification of significant differences between cor(K,C) 

Differences in standardized C(k) curves may distinguish 
biologically interesting groups of samples. For example, 
assume two sample networks (corresponding to two 
groups of samples) and two corresponding measures of 
cor{K,C). To identify significant differences in cor(K,C) 
between two sample networks, we use a test for asses- 
sing the significance of differences in correlations from 
samples of different sizes. First, cor{K>C) for each sample 
group is transformed using the Fisher transformation: 



z k = 0.5 * log 



1 + cor(K, C) l 
l-cor(K,C) k 



where k indexes the sample networks being compared. 
For the comparison between groups (sample networks) 
1 and 2, the difference between the resulting ^-scores is 
divided by the joint standard error: 



Zdiff 



z\-z 2 



(«i-3) (« 2 -3) 



where n x and n 2 represent the number of samples in 
groups 1 and 2, respectively. Under the null hypothesis 
of equal cor(K,C), z di ff follows asymptotically a normal 
distribution (under weak assumptions). Therefore we 
calculate significance levels (P-values) for Zdtff based 
upon the standard normal distribution. 



Simulation model for illustrating the ability of cor(K,Q to 
distinguish sample groups in the absence of differential 
expression 

To further illustrate the utility of cor{K,C), we simulated a 
set of 500 genes (referred to as a "module") with the fol- 
lowing properties: i) the first principal component (the 
observed module eigengene [ME]) exhibited no relation- 
ship to a simulated sample trait (referred to as "disease sta- 
tus"), and ii) cor(K,C) distinguished "control" subjects from 
those with "moderate" or "severe" disease status. The mod- 
ule was simulated to contain two unrelated sub-modules of 
200 and 300 genes, respectively. The first sub-module con- 
tained a signal for the simulated sample trait, while the sec- 
ond sub-module contained noise genes with no relation to 
disease status. The first sub-module was simulated in two 
steps. First, we used a seed ME as input for the simulate- 
Module function from the R package WGCNA [34]. This 
function simulates genes with varying correlations around 
the seed ME and exports standardized gene expression 
values (i.e. each gene has mean = 0 and variance = 1). Sec- 
ond, we added a mean value to each module gene. Import- 
antly, the mean gene expression values depended on the 
value of the seed ME. For subjects whose seed ME values 
were above the median, mean expression values were 
drawn from a normal distribution with mean = 2 and 
standard deviation = 2. For subjects whose seed ME values 
were below the median, mean expression values were 2/3 
those of the control subjects (i.e. it was assumed that the 
disease lowered the mean gene expression values in sub- 
module 1). Analogously, we simulated the expression 
values for the second sub-module. However, we assumed 
that the mean gene expression values were derived from a 
normal distribution with mean = 2/3 and standard devi- 
ation = 2/3 (i.e. the mean values of these genes tended to 
have lower expression values than those of the first sub- 
module). The sample trait was simulated by thresholding 
the seed ME of the first sub-module. We assumed that 
healthy control subjects have a high value of the seed ME. 
Specifically, we simulated 100 individuals, with 50 desig- 
nated as "control" subjects (darkgreen), 25 designated as 
"moderate" disease status (red), and 25 designated as "se- 
vere" disease status (turquoise), as indicated in Figure 5. In 
practice, the seed ME was not known. Instead, the 
observed ME for the entire module was obtained as the 
first principal component of the set of 500 genes. 

Additional network concepts for sample networks 

In addition to characterizing sample networks via the con- 
nectivity and the clustering coefficient, it is also possible to 
characterize sample networks using additional network 
concepts. Such concepts include decentralization and 
homogeneity, as well as summaries of node-based measures 
such as the mean correlation, mean connectivity, mean 
clustering coefficient, mean inter sample adjacency (or 
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density), and mean maximum adjacency ratio (MAR). 
When applied to sample networks, these concepts provide 
a battery of measures for comparing the consistency of 
sample behavior within and across datasets. These network 
concepts are calculated automatically by SampleNetwork 
and are discussed further in Additional file 1 and our R tu- 
torial (Additional file 3). 

Differential expression analysis 

To determine whether specific CN gene co-expression mod- 
ules were associated with DE in HD, for each CN module 
we calculated the ME (i.e. the first principal component 
obtained by singular value decomposition), which is a vector 
that summarizes the characteristic expression pattern of a 
module [10]. We then used Student s ^-test to determine 
whether the mean expression levels of the ME differed be- 
tween groups (distinguished by HD diagnosis). An advantage 
of this approach is that the extent of modular DE can be 
summarized by a single P-value. Future efforts may seek to 
incorporate higher-order representative features (beyond the 
first principal component) to explore additional relationships 
between gene co-expression modules and disease status [51]. 
Differential gene expression in CN between CTRL and HD 
subjects (Additional file 4) was assessed using Students £-test 
on log 2 -transformed expression values. The resulting P- 
values were corrected for multiple comparisons by control- 
ling for the false-discovery rate [52]. The resulting local 
false-discovery rates (referred to as Q-values), along with 
mean expression levels for CTRL and HD, are reported for 
all genes in the salmon module in Additional file 4. 

Ingenuity pathways analysis 

Ingenuity Pathways Analysis (IPA; http://www.ingenuity. 
com/) was used to determine whether gene co-expression 
modules identified in [21] were enriched with functional 
interactions among their constituent genes. For each mod- 
ule, probe sets that were positively correlated with the 
module eigengene (P < 0.001) were used to search for en- 
richment. Network construction was restricted to experi- 
mentally verified, direct physical interactions. IPA reported 
false-discovery rate (FDR) -corrected P-values for the 500 
most enriched functionally annotated categories of genes 
in each module. Results for the salmon module are 
reported in Additional file 5. 

Additional files 



Additional file 1: Supplementary information. PDF file containing 
Supplementary Methods, Supplementary References, and Supplementary 
Figures (1-8). 

Additional file 2: Sample information. XLS table that summarizes 
sample information, including Gene Expression Omnibus (GEO: 
http://www.ncbi.nlm.nih.gov/geo/) sample ID, sample labels, diagnosis, 
severity grade, age, sex, individual ID, hybridization date, hybridization 
batch assignment, and country of processing, as described in ref. [14]. 



Additional file 3: SampleNetwork R tutorial. DOC file containing 
annotated R code and detailed instructions for executing the 
SampleNetwork and ModuleSampleNetwork R functions. The datasets 
that are referenced in the tutorial and analyzed in this study can be 
downloaded from: http://www.genetics.ucla.edu/labs/horvath/ 
CoexpressionNetwork/SampleNetwork. 

Additional file 4: Summary of differential expression, functional 
enrichment, and module membership for genes in the salmon 
module. XLS table that summarizes the extent of differential expression, 
functional enrichment, and membership strength for the salmon module. 
Differential expression analyses include CTRL vs. HD human caudate 
nucleus samples [14], CTRL vs. HD human laser-microdissected striatal 
neurons [14], and wild-type mice vs. a mutant mouse model of HD 
cultured primary striatal neurons [20]. Functional enrichment categories 
included G-protein coupled receptors, phosphatidylinositol signaling, 
calmodulin binding, ion transport, and calcium ion binding; all of these 
categories were significantly enriched in the salmon module [21]. Module 
membership values and corresponding P-values are taken from ref. [21]. 

Additional file 5: Ingenuity Pathways Analysis of salmon module 

genes. XLS table that reports false-discovery rate (FDR)-corrected P- 
values for the 500 most enriched functionally annotated categories of 
genes in the salmon module from Ingenuity Pathways Analysis (IPA; 
http://www.ingenuity.com/). 
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